Text Manipulation

Introduction

In this chapter, we explore various text manipulation techniques in R. More specifically, we start by discussing how to handle strings, focusing on printing and combining (pasting) them. Next, we introduce the stringr package, part of the tidyverse family. This package includes functions that make handling text data much easier compared to base R. Lastly, we cover how to identify and work with patterns in text data, a concept known as regular expressions.

From Numeric Data to Strings

When working with numeric data, it is quite intuitive to perform operations like addition or multiplication on vectors. However, manipulating strings (or character data), one of the core data types in R, requires specific functions. String manipulation can become complex, especially when combining strings from a single vector or different columns of a data frame. With text data, we can perform tasks such as adding or replacing text, finding matches, counting letters, locating positions of specific text characters, and much more.

Printing Strings

We can use single quotes ('') or double quotes ("") to specify a value (any value) as a string. For instance, suppose we want to print one of the most well known phrases in the Computer Science, Data Science and Data Engineering world, "Hello world!". We can print this phrase with the print() function, enclosing the text in single or double strings:

# Printing with single quotes
print('Hello world!')

[1] "Hello world!"

# Printing with double quotes
print("Hello world!")

[1] "Hello world!"

In both cases, we see that we get the exact same results. However, what happens if we need to have double or single quotes within a string? Since R would not know which quotes we want to include in the string, we need to be able to clarify which quotes are part of the text itself, and which quotes are used to indicate a string. To do so, we need to use what is called an escape sequence. For this, we use the special character backslash (\) before the single or double quotes that we want to include in the string and the function cat(). The cat() function is used to concatenate and display text in a way that is more suitable for string formatting, especially when working with escape sequences or when we want to print text exactly as it appears, without additional characters like quotes or backslashes. Unlike the print() function, which shows the internal representation of objects (including quotes around strings), the cat() function outputs the string as plain text:

# Printing "I want to print "Hello World"" with print()
print("I want to print \"Hello World!\"")

[1] "I want to print \"Hello World!\""

# Printing "I want to print "Hello World!"" with cat()
cat("I want to print \"Hello World!\"")

I want to print "Hello World!"

The print() function displays the string as "I want to print \"Hello world!\"" because it shows both the double quotes and the backslashes. However, the cat() function displays I want to print "Hello world!" without the additional characters, making the output cleaner and more readable.

Using cat() is particularly useful when formatting strings that include special characters such as quotes, as it provides more control over how the output appears.

Pasting Results

The function paste() can be used when we want to combine two or more string values into a simple string. For instance, we can use the paste() function to print "Hello World", when the two words are separate strings. The example below can help us understand the difference:

# Printing "Hello World" with paste
paste("Hello", "World!")

[1] "Hello World!"

The paste() function combines the two string values into one string, separating them by a space. This occurs because the default separator of the paste function is space. We can use a different separator though by changing the argument sep. For example, suppose we want to print the string "Data-Science":

# Printing "Data-Science"
paste("Data", "Science", sep = "-")

[1] "Data-Science"

Things can become complicated when we start including whole character vectors instead of a single string value inside the paste() function. For instance, suppose we have a scalar (one-element vector or just a single value as before) and a vector of two elements ("Science" and "Analytics"):

# Printing with a scalar and a vector
scalar <- "Data"
vector <- c("Science", "Analytics")
paste(scalar, vector, sep = "-")

[1] "Data-Science"   "Data-Analytics"

When we have two vectors of the same length, vectorization takes place, as we would expect with vector inputs. This means that each element of one vector is combined with the corresponding element of the other vector.

# Printing with two vectors
vector1 <- c("Data", "Science")
vector2 <- c("Data", "Analytics")
paste(vector1, vector2, sep = "-")

[1] "Data-Data"         "Science-Analytics"

With vectors that are not of the same length, R will recycle the shorter vector to match the length of the longer one. This means that the shorter vector is repeated until it matches the length of the longer vector, which can sometimes lead to unexpected or undesired results if not used carefully.

# Printing with two vectors
vector1 <- c("Data", "Science")
vector2 <- c("Data", "Analytics", "Engineering")
paste(vector1, vector2, sep = "-")

[1] "Data-Data"         "Science-Analytics" "Data-Engineering"

If we want to combine all elements together, we can use the argument collapse, including a character based on which we want to make this combination. For instance, suppose we add to the last example the argument collapse with the value " and " (notice that we included spaces in-between):

# Printing with a scalar and a vector
scalar <- "Data"
vector <- c("Science", "Analytics")
paste(scalar, vector, sep = "-", collapse = " and ")

[1] "Data-Science and Data-Analytics"

# Printing with two vectors
vector1 <- c("Data", "Science")
vector2 <- c("Data", "Analytics")
paste(vector1, vector2, sep = "-", collapse = " and ")

[1] "Data-Data and Science-Analytics"

Lastly, a variation of the function paste() is the function paste0(). The difference between these two functions is that the first leaves a space between every piece of text we include, while the second does not:

# Printing "Hello World" with paste()
paste("Hello", "World!")

[1] "Hello World!"

# Printing "Hello World" with paste0()
paste0("Hello", "World!")

[1] "HelloWorld!"

When we use the paste() or the paste0() function, it is a good idea to try some prints just to make sure that the output is the expected one; we saw that when we have vectors inside the function, things can become complicated.

The stringr Package

As mentioned at the beginning, the package we can use for text manipulation is the stringr package, which can also be loaded with the tidyverse package. Although base R already provides many alternative functions to manipulate strings, it is better to use the stringr package due to its consistency on the basic syntax, i.e.,:

Every function starts with str_.
The first argument of every function is always a string, referring to the string value(s) upon which we want to apply the function.

With RStudio, this is very handy as we are not concerned about remembering all these string manipulation functions: when we type str, RStudio automatically shows us all available alternatives.

Let’s start by loading the stringr package and creating a vector of character values:

# Libraries
library(stringr)

# Quotes
quotes <- c("Become a Master in Data Science.", 
            "The best way to learn data science is to do data science.", 
            "Text mining is an essential skill.")

With this character vector, we can experiment using the many functions of the stringr package, all with different purposes. For instance, suppose we want to check whether the string "is" exists within each value of the vector. We can do this by using the str_detect() function, since we try to “detect” whether a specific pattern exists within a string:

# Is the pattern in the string?
str_detect(quotes, pattern = "is")

[1] FALSE  TRUE  TRUE

As expected, we get the values FALSE, TRUE and TRUE because the string "is" is found within the second and the third element of the vector but not in the first one.

Another similar function is the function str_which(). This function is similar to the function which() from base R and shows in which elements the specified pattern exists:

# Returning the indexes of entries that contain the pattern
str_which(quotes, pattern = "is")

[1] 2 3

Regarding sub-setting strings, the functions str_sub() and str_subset() can be used. The first subsets a string based on specified positions while the second subsets a string based on a specified pattern. The example below shows how they work and helps us understand the difference between the two:

# Extracting the first 6 characters
str_sub(quotes, start = 1, end = 6)

[1] "Become" "The be" "Text m"

# Returning the subset of the strings that contains the string "Master"
str_subset(quotes, pattern = "Master")

[1] "Become a Master in Data Science."

If we want to check whether a specific pattern exists within a string, the function str_view() emphasizes this pattern (if it exists of course).

# Emphasizing word "is"
str_view(quotes, pattern = "is")

[2] │ The best way to learn data science <is> to do data science.
[3] │ Text mining <is> an essential skill.

With the function str_split(), we can split a string into a list with its parts being separated by a specified pattern. In the example below, we see how each element is a different part of the list as well as how the string in the second and third element is split within the list:

# Splitting the quotes to create a list
str_split(quotes, pattern = "is")

[[1]]
[1] "Become a Master in Data Science."

[[2]]
[1] "The best way to learn data science " " to do data science."               

[[3]]
[1] "Text mining "         " an essential skill."

We see that, in practice, the functions found in stringr are simple and effective. There are many other functions in stringr, but the table below provides an overview regarding the ones most commonly used. More specifically, the below table includes the names of the functions, their description, a usage example using the vector “quotes” (the one previously created) and the respective output. It is advisable to come back and check this table when we want to solve a task that includes strings.

Function	Description
str_detect()	Is the pattern in the string?
str_which()	Return the indexes of entries that contain the pattern
str_sub()	Extract the characters based on a specified positions (e.g. from 1 to 6)
str_subset()	Return the subset of the strings that contains the pattern(e.g. "Master")
str_replace()	Replace the first part of a string with another (if pattern is matched)
str_replace_all()	Replace all parts of a string with another (if pattern is matched)
str_locate()	Return positions of the first occurrence of the specified pattern
str_locate_all()	Return positions of all occurrences of the specified pattern
str_to_upper()	Change all characters to upper case letters
str_to_lower()	Change all characters to lower case letters
str_to_title()	Change first character to upper and rest to lower
str_length()	Number of characters in a string
str_count()	Count number of times a pattern appears in a string
str_replace_na()	Replace all NAs to a new specified value
str_trim()	Remove white space at the start and at the end of a string
str_sort()	Sort the vector in alphabetical order
str_order()	Indexes to order the vector in alphabetical order
str_trunc()	Truncate a string to a fixed size (the dots consume 3 spots)
str_c()	Joining strings
str_view_all()	Emphasize all the parts of a string that match the pattern
str_split()	Split a string into a list with its parts to be separated by the pattern

Regular Expressions

In R, regular expressions are pattern-matching tools that enable the concise and flexible manipulation of text data by providing a syntax for specifying search patterns and facilitating string matching and manipulation operations. Put simply, we use regular expressions to describe patterns in strings (Friedl, 2006). To understand what this means and how we can use regular expressions, we will use the string "Data!" with the str_detect() function that we discussed previously:

# Checking regular expression for "Data!"
str_detect("Data!", pattern = "^....!")

[1] TRUE

What exactly is this pattern? As we see, we just matched the pattern of “Data!” using a sequence of special characters. The special character caret (^) signifies the start of a string, without considering (or representing) the first letter. Then, we used the special character dot (.) 4 times because a dot represents a single letter in our string. Since the word "Data" contains 4 letters, we used dot (.) 4 times to capture the pattern. Lastly, we included the special character exclamation mark (!) because it appears in our string. As a result, we described the pattern of the string "Data!" fully and that is why we got TRUE as an output. It is important to understand that the exact same regular expression would describe similar strings such as "Math!" or “Stat!" as the pattern is exactly the same (4 letters, followed by an exclamation mark (!)):

# Checking regular expression for "Math!"
str_detect("Math!", pattern = "^....!")

[1] TRUE

# Checking regular expression for "Stat!"
str_detect("Stat!", pattern = "^....!")

[1] TRUE

That is actually the difference between regular expressions and using the exact same value of a string as a pattern. Had we used the value “Data!” in the argument pattern, of course we would get the output TRUE in the first example but we would get FALSE in the other two examples. Because our purpose is to describe the general pattern of sequential values in a vector, it is very useful to be able to describe those patterns:

# Checking regular expression for "Data!" with the pattern "Data!"
str_detect("Data!", pattern = "Data!")

[1] TRUE

# Checking regular expression for "Math!" with the pattern "Data!"
str_detect("Math!", pattern = "Data!")

[1] FALSE

# Checking Regular Expression for "Stat!" with the pattern "Data!"
str_detect("Stat!", pattern = "Data!")

[1] FALSE

The main question that arises of course is “what is the point of learning regular expressions?”. Regular expressions are very useful when it comes to text data manipulation. For instance, suppose we have a vector that describes the body weight of 5 people.

# Creating a vector
body_weight <- c("75 KG", "82 KG", "85 KG", "68 KG", "79 KG")

# Printing body_weight 
body_weight

[1] "75 KG" "82 KG" "85 KG" "68 KG" "79 KG"

# Printing the class of body_weight
class(body_weight)

[1] "character"

Since we have text in this vector, the data type of our created vector is character. In practice though, we would probably want to separate those numeric values from the character values within each element of the vector body_weight in order to perform analysis on the numeric values. In other words, there is no need to have the string "KG" in our vector. By describing the pattern in the function str_remove() from the stringr package we can remove those strings using regular expressions:

# Creating a vector
body_weight <- str_remove(string = body_weight, pattern = c(" ..$"))

# Printing body_weight 
body_weight

[1] "75" "82" "85" "68" "79"

# Printing the class of body_weight
class(body_weight)

[1] "character"

This simple example clearly illustrates the value of regular expressions. However, regular expressions can be very confusing, especially when we are working with more complicated strings. For this reason, we will learn how to create regular expressions with the rebus package, which contains functions to help us construct regular expressions easier, in a way closer to human language. Although not part of tidyverse, this package can be greatly combined with the stringr package. To see how this works, let’s load the rebus package and use its syntax to describe the same string "Data!" (with the str_detect() function):

# Library
library(rebus)

# Checking regular expression for "Data!" with base R
str_detect("Data!", pattern = "^....!")

[1] TRUE

# Checking regular expression for "Data!" with rebus
str_detect("Data!",
           pattern = START %R% ANY_CHAR %R% 
             ANY_CHAR %R% ANY_CHAR %R% ANY_CHAR %R% "!")

[1] TRUE

The pattern is much more understandable the way we wrote it using the rebus package. We start (START) the pattern, then we use 4 times the syntax ANY_CHAR because the string "Data" consists of 4 letters, and finally we use an exclamation mark in double quotes. The special operator %R% can be read as “followed by” or “then”. With the mentioned syntax though, we describe the whole pattern of the string "Data!". We could describe this word in many different ways, such as the ones in the following example:

# "Data!": the pattern is that the character value starts with any character
str_detect("Data!", pattern = START %R% ANY_CHAR)

[1] TRUE

# "Data!": the pattern is that the character value ends with exclamation mark
str_detect("Data!", pattern = "!" %R% END)

[1] TRUE

In both cases, we get the value TRUE. It is therefore important to understand that there is no need to describe the whole pattern every time. How we will describe a pattern though really depends on the underlying data. Especially with regular expressions, it is important to practice and try to understand the output that we get when we use a specific pattern. In our previous example, if we set the pattern to the string "START %R% one_or_more(DGT)", we get the value FALSE. Clearly, the reason would be that the string "Data!" does not start with one or more digits, but if we had the string "4-Data!", we would get the value TRUE:

# Describing "Data!"
str_detect("Data!", pattern = START %R% one_or_more(DGT))

[1] FALSE

# Describing "4-Data!"
str_detect("4-Data!", pattern = START %R% one_or_more(DGT))

[1] TRUE

Now that we have discussed the intuition behind regular expressions and the rebus package, we can focus on the more technical details. The table below summarizes the syntax used in the rebus package to represent different patterns. It is important to note that the syntax that we use with rebus is not considered a regular expression; we use rebus to construct a regular expression in an easier way.

Rebus	Description
START	Start of a string
END	End of a string
ANY_CHAR	Any single character
optional()	Optional pattern
zero_or_more()	Zero or more occurences
one_or_more()	One or more occurences
repeated()	Repeated pattern
or()	Choice among alternatives
char_class()	Any character within a specified set
negated_char_class()	Any character NOT in a specified set
CARET	Caret sign
DOLLAR	Dollar sign
DOT	Dot sign
DGT	Any digit
WRD	Any character
SPC	Any whitespace